35 research outputs found

    A runtime heuristic to selectively replicate tasks for application-specific reliability targets

    Get PDF
    In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.This work was supported by FI-DGR 2013 scholarship and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and in part by the European Union (FEDER funds) under contract TIN2015-65316-P.Peer ReviewedPostprint (author's final draft

    Diluting the Scalability Boundaries: Exploring the Use of Disaggregated Architectures for High-Level Network Data Analysis

    Get PDF
    Traditional data centers are designed with a rigid architecture of fit-for-purpose servers that provision resources beyond the average workload in order to deal with occasional peaks of data. Heterogeneous data centers are pushing towards more cost-efficient architectures with better resource provisioning. In this paper we study the feasibility of using disaggregated architectures for intensive data applications, in contrast to the monolithic approach of server-oriented architectures. Particularly, we have tested a proactive network analysis system in which the workload demands are highly variable. In the context of the dReDBox disaggregated architecture, the results show that the overhead caused by using remote memory resources is significant, between 66\% and 80\%, but we have also observed that the memory usage is one order of magnitude higher for the stress case with respect to average workloads. Therefore, dimensioning memory for the worst case in conventional systems will result in a notable waste of resources. Finally, we found that, for the selected use case, parallelism is limited by memory. Therefore, using a disaggregated architecture will allow for increased parallelism, which, at the same time, will mitigate the overhead caused by remote memory.Comment: 8 pages, 6 figures, 2 tables, 32 references. Pre-print. The paper will be presented during the IEEE International Conference on High Performance Computing and Communications in Bangkok, Thailand. 18 - 20 December, 2017. To be published in the conference proceeding

    Designing and modelling selective replication for fault-tolerant HPC applications

    Get PDF
    Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.This work is supported in part by the European Union Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and the FEDER funds under contract TIN2015-65316-P.Peer ReviewedPostprint (author's final draft

    Disaggregated Computing. An Evaluation of Current Trends for Datacentres

    Get PDF
    Next generation data centers will likely be based on the emerging paradigm of disaggregated function-blocks-as-a-unit departing from the current state of mainboard-as-a-unit. Multiple functional blocks or bricks such as compute, memory and peripheral will be spread through the entire system and interconnected together via one or multiple high speed networks. The amount of memory available will be very large distributed among multiple bricks. This new architecture brings various benefits that are desirable in today’s data centers such as fine-grained technology upgrade cycles, fine-grained resource allocation, and access to a larger amount of memory and accelerators. An analysis of the impact and benefits of memory disaggregation is presented in this paper. One of the biggest challenges when analyzing these architectures is that memory accesses should be modeled correctly in order to obtain accurate results. However, modeling every memory access would generate a high overhead that can make the simulation unfeasible for real data center applications. A model to represent and analyze memory disaggregation has been designed and a statistics-based queuing-based full system simulator was developed to rapidly and accurately analyze applications performance in disaggregated systems. With a mean error of 10%, simulation results pointed out that the network layers may introduce overheads that degrade applications’ performance up to 66%. Initial results also suggest that low memory access bandwidth may degrade up to 20% applications’ performance.This project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 687632 (dReDBox project) and TIN2015-65316-P - Computacion de Altas Prestaciones VII.Peer ReviewedPostprint (published version

    Unified fault-tolerance framework for hybrid task-parallel message-passing applications

    Get PDF
    We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the FI-DGR 2013 scholarship and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and TIN2015-65316-P.Peer ReviewedPostprint (author's final draft

    QuakeTM: Parallelizing a complex serial application using transactional memory

    Get PDF
    'Is transactional memory useful?' is the question that cannot be answered until we provide substantial applications that can evaluate its capabilities. While existing TM applications can partially answer the above question, and are useful in the sense that they provide a first-order TM experimentation framework, they serve only as a proof of concept and fail to make a conclusive case for wide adoption by the general computing community. This work presents QuakeTM, a multiplayer game server; a complex real life TM application that was parallelized from the serial version with TM-specific considerations in mind. QuakeTM consists of 27,600 lines of code spread among 49 files and exhibits irregular parallelism and coarse-grain transactions with large read and write sets. In spite of its complexity, we show that QuakeTM does scale, however more effort is needed to decrease the overhead and the abort rate of current software transactional memory systems. We give insights into development challenges, suggest techniques to solve them and provide extensive analysis of transactional behavior of QuakeTM, with an emphasis and discussion of the TM promise of making parallel programming easy.Postprint (published version

    Transaction processing core for accelerating software transactional memory

    Get PDF
    Submitted for review to MICRO-40 conference the 9th of June 2007This paper introduces an advanced hardware based approach for accelerating Software Transactional Memory (STM). The proposed solution focuses on speeding up conflict detection that grows polynomially with the number of concurrently running transactions and shared to transaction-local address resolution, which is the most frequent STM operation. This is achieved by logic split in two hardware units: Transaction Processing Core and Transactional Memory Look-Aside Buffer. The Transaction Processing Core is a separate hardware unit which does eager conflict detection and address resolution by indexing transactional objects based on their virtual addresses. The Transactional Memory Look-aside Buffer is a per-processor extension that caches the translated addresses by the Transaction Processing Core. The effect of its function is a reduced bus traffic and the time spent for communication between the CPUs and the Transaction Processing Core. Compared with other existing solutions, our approach mainly differs in proposing an implementation that is not based on the processor cache but a separate on-chip core, uses virtual addresses, does not require application modification and is further enhanced by Transactional Memory Look-Aside Buffer. Our experiments confirm the potential of the Transaction Processing Core to dramatically speed up STM systems.Postprint (published version

    dReDBox: Materializing a full-stack rack-scale system prototype of a next-generation disaggregated datacenter

    Get PDF
    Current datacenters are based on server machines, whose mainboard and hardware components form the baseline, monolithic building block that the rest of the system software, middleware and application stack are built upon. This leads to the following limitations: (a) resource proportionality of a multi-tray system is bounded by the basic building block (mainboard), (b) resource allocation to processes or virtual machines (VMs) is bounded by the available resources within the boundary of the mainboard, leading to spare resource fragmentation and inefficiencies, and (c) upgrades must be applied to each and every server even when only a specific component needs to be upgraded. The dRedBox project (Disaggregated Recursive Datacentre-in-a-Box) addresses the above limitations, and proposes the next generation, low-power, across form-factor datacenters, departing from the paradigm of the mainboard-as-a-unit and enabling the creation of function-block-as-a-unit. Hardware-level disaggregation and software-defined wiring of resources is supported by a full-fledged Type-1 hypervisor that can execute commodity virtual machines, which communicate over a low-latency and high-throughput software-defined optical network. To evaluate its novel approach, dRedBox will demonstrate application execution in the domains of network functions virtualization, infrastructure analytics, and real-time video surveillance.This work has been supported in part by EU H2020 ICTproject dRedBox, contract #687632.Peer ReviewedPostprint (author's final draft

    dReDBox: A Disaggregated Architectural Perspective for Data Centers

    Get PDF
    Data centers are currently constructed with fixed blocks (blades); the hard boundaries of this approach lead to suboptimal utilization of resources and increased energy requirements. The dReDBox (disaggregated Recursive Datacenter in a Box) project addresses the problem of fixed resource proportionality in next-generation, low-power data centers by proposing a paradigm shift toward finer resource allocation granularity, where the unit is the function block rather than the mainboard tray. This introduces various challenges at the system design level, requiring elastic hardware architectures, efficient software support and management, and programmable interconnect. Memory and hardware accelerators can be dynamically assigned to processing units to boost application performance, while high-speed, low-latency electrical and optical interconnect is a prerequisite for realizing the concept of data center disaggregation. This chapter presents the dReDBox hardware architecture and discusses design aspects of the software infrastructure for resource allocation and management. Furthermore, initial simulation and evaluation results for accessing remote, disaggregated memory are presented, employing benchmarks from the Splash-3 and the CloudSuite benchmark suites.This work was supported in part by EU H2020 ICT project dRedBox, contract #687632.Peer ReviewedPostprint (author's final draft

    Unprotected Computing:A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer

    Get PDF
    Supercomputers offer new opportunities for scientific computing as they grow in size. However, their growth also poses new challenges. Resilience has been recognized as one of the most pressing issues to solve for extreme scale computing. Transistor scaling in the single-digit nanometer era and power constraints might dramatically increase the failure rate of next generation machines. DRAM errors have been analyzed in the past for different supercomputers but those studies are usually based on job scheduler logs and counters produced by hardware-level error correcting codes. Consequently, little is known about errors escaping hardware checks, which lead to silent data corruption. This work attempts to fill that gap by analyzing memory errors for over a year on a cluster with about 1000 nodes featuring low-power memory without error correction. The study gathered millions of events recording detailed information of thousands of memory errors, many of them corrupting multiple bits. Several factors are analyzed, such as temporal and spatial correlation between errors, but also the influence of temperature and even the position of the sun in the sky. The study showed that most multi-bit errors corrupted non-adjacent bits in the memory word and that most errors flipped memory bits from 1 to 0. In addition, we observed thousands of cases of multiple single-bit errors occurring simultaneously in different regions of the memory. These new observations would not be possible by simply analyzing error correction counters on classical systems. We propose several directions in which the findings of this study can help the design of more reliable systems in the future.The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-Blanc 2 Project (www.montblanc-project.eu), grant agreement n 610402 and it has been supported in part by the European Union (FEDER funds) under contract TIN2015-65316-P.Peer ReviewedPostprint (author's final draft
    corecore